--- title: "Week 3: More on Mixed Effects Models" --- More On Mixed Effects Models

Agenda

  • Recapping last week
  • Discussing the assumptions of LMEs
  • Should variable be fixed or random effects?
  • Some pitfalls.
library(tidyverse)
library(broom)
library(ggplot2)
library(lme4)
library(multivariate2017)
ayT <- ay %>% 
          filter(plt_vclass == "ay0") %>%
          mutate(dob = year-age,
                 decade = (dob - 1900)/10,
                 log2dur = log2(dur),
                 dur_c = log2dur - median(log2dur)) 

Recapping LME

Let’s recap what we learned about about mixed effect models last week. First, it is a reasonable thing to want to know that the relationship between duration and the F1 of /ay/ is:

ayT %>%
  ggplot(aes(dur_c, F1_n))+
    geom_point()+
    stat_smooth(method = 'lm')+
    theme_bw()

But if we fit just a flat model, there are many different violations of statistical assumptions, and the model is overconfident.

ayT_flat_model <- lm(F1_n ~ dur_c, data = ayT)
summary(ayT_flat_model)

Call:
lm(formula = F1_n ~ dur_c, data = ayT)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4731 -0.3900 -0.0324  0.3498  5.2217 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.487737   0.004161  117.22   <2e-16 ***
dur_c       0.435710   0.007068   61.65   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.578 on 19340 degrees of freedom
Multiple R-squared:  0.1642,    Adjusted R-squared:  0.1642 
F-statistic:  3800 on 1 and 19340 DF,  p-value: < 2.2e-16

There is a lot of structure to the data that is being overlooked in estimating the relationship between duration and F1. First of all, it doesn’t look like every speaker has the same relationship.

random_speakers <- sample(unique(ayT$idstring), 8)
ayT%>%
  filter(idstring %in% random_speakers)%>%
  ggplot(aes(dur_c, F1_n))+
    geom_point()+
    stat_smooth(method = lm)+
    facet_wrap(~idstring, ncol= 4)+
    theme_bw()

Second, the data is far from balanced across speakers.

ayT %>%
  count(idstring)%>%
  ggplot(aes(n))+
    stat_bin()+
    scale_x_log10()+
    xlab("n (log scale)")+
    theme_bw()

Let’s look again at our estimates from the flat model:

summary(ayT_flat_model)$coef
             Estimate  Std. Error   t value Pr(>|t|)
(Intercept) 0.4877373 0.004160927 117.21842        0
dur_c       0.4357102 0.007067868  61.64662        0

Those estimates are going to largely be a characterization of the relationship between duration and F1 for the speakers with the most data. One way, you could try to get around this problem would be to fit one linear model for every speaker.

ayT%>%
  ggplot(aes(dur_c, F1_n))+
    stat_smooth(method = lm, 
                aes(group = idstring), 
                se = F,
                geom = "line",
                alpha = 0.4)+
    theme_bw()

But this isn’t ideal for a few reasons.

  • We don’t wind up with a precise estimate of the general effect of duration on F1.
  • We don’t wind up with an estimate of our uncertainty of the effect of duration on F1.
  • None of the estimates of these models takes into account the estimates of any other model.

So, instead, we fit a mixed effects model with a random intercept and a random slope by speaker.

ayT_lme <- lmer(F1_n ~ dur_c + (1 + dur_c | idstring), data = ayT)
summary(ayT_lme)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ dur_c + (1 + dur_c | idstring)
   Data: ayT

REML criterion at convergence: 27306

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.6543 -0.6010 -0.0430  0.5386 10.3140 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 idstring (Intercept) 0.10060  0.3172        
          dur_c       0.01769  0.1330   -0.12
 Residual             0.22564  0.4750        
Number of obs: 19342, groups:  idstring, 326

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.58113    0.01827   31.81
dur_c        0.34284    0.01071   32.00

Correlation of Fixed Effects:
      (Intr)
dur_c -0.092

Here, we have the fixed effects estimates:

summary(ayT_lme)$coef
             Estimate Std. Error  t value
(Intercept) 0.5811306  0.0182693 31.80913
dur_c       0.3428392  0.0107126 32.00335

This is the best estimate across speakers for the relationship between F1 and duration. But we also get some valuable info about how variable around these estimates we can expect observed speakers to be!

summary(ayT_lme)$varcor
 Groups   Name        Std.Dev. Corr  
 idstring (Intercept) 0.31718        
          dur_c       0.13301  -0.120
 Residual             0.47502        

The standard deviation of the random slopes here is fairly respectable. Let’s say there was a speaker who had double the slope of the group estimate. How many standard deviations would they be from the group average?

slope <- 0.3428392 
slope_sd <- 0.13301
((slope * 2) - slope)/slope_sd
[1] 2.577545

Two and a half sds is fairly large, but not an extreme outlier.

The assumptions of Mixed Effects Models

Mixed effects models relax some assumptions of fixed-effects models, namely that all observations are i.i.d. It does so by allowing for partial pooling of the data according to grouping factors. However, mixed effects models are not without assumptions.

Specifically, the model assumes that the random effects are drawn from a normal distribution. The clue that the model is making this assumption is in that it gives us estimates of standard deviation of the random effects, which is a property of normal disributuons.

This assumption has a good effect when it comes to predicting the slopes and intercepts for individual speakers. Last week, we looked at the results of fitting one linear model for each speaker. Here’s the density plots of the distribution of all of the slopes and intercepts from all of those models.

Now, let’s compare these by-speaker intercepts and slopes (in grey) to the random effects from our linear mixed effects model (in yellow).

You can see that for both the intercept and the slope, but especially the slope, the range of interspeaker values has been sucked in. This is also known as “shrinkage.” If we were interested in how different the effect of duration could be between speakers, this would suggest that simply fitting a model for every speaker would overestimate the variance!

Shrinkage is desireable for a few reasons. The first is that it is often the case that extreme outlier subjects will be less extreme on a second testing (i.e. regression to the mean). So, it makes sense to moderate the estimates of outlier speakers. The second is that we ought to have higher uncertertainty for many of the speakers who are estimated to have an extreme duration effect than for those speakers whose slopes are closer to the group average

To demonstrate this point, here are the six speakers who had the biggest difference between the Maximum Likelihood estimate of their slopes (that is, from fitting a model for each speaker) and their predicted slopes from the random effects. The Maximum Likelihood estimate is the dashed black line, and the random effects predictions are in the red lines.

Compare this to the speakers whose ML estimates are the most similar to the random effects predictions.

The assumption that the random effects are drawn from a normal distribution are great because

  1. Shrinkage is good.
  2. It makes their estimation tractible on most laptops!

But as with all statistical assumptions, some caution should be taken where they may be violated. The most major, and likely most common, violation of this normality assumption is if all of the levels of your grouping factor aren’t drawn from a single distribution! For example, here is some simulated data from 6 speakers. Speakers a, b and c come from a group disribution with mean 0, and speakers d, e and f come from a group distribution with mean 4.

speakers <- letters[1:6]
speaker_means <- c(rnorm(3, mean = 0, sd = 1),
                   rnorm(3, mean = 4, sd = 1))
speaker_samps <- map(speaker_means, ~rnorm(50, mean = .x, sd = 1)) %>% simplify()
fake_speakers <- data.frame(speaker = rep(speakers, each = 50),
                            group = rep(c("x", "y"), each = (50 * 3)),
                            outcome = speaker_samps)
ggplot(fake_speakers, aes(outcome, fill = group))+
  stat_bin(position = "identity", alpha = 0.8)+
  scale_fill_colorblind()+
  theme_bw()

NA

We could fit an mixed effects model with just a random intercept by speaker.

fake_lme_1 <- lmer(outcome ~ (1|speaker), data = fake_speakers)
summary(fake_lme_1)
Linear mixed model fit by REML ['lmerMod']
Formula: outcome ~ (1 | speaker)
   Data: fake_speakers

REML criterion at convergence: 878.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.80366 -0.69879  0.00362  0.59143  2.88948 

Random effects:
 Groups   Name        Variance Std.Dev.
 speaker  (Intercept) 11.5513  3.3987  
 Residual              0.9735  0.9867  
Number of obs: 300, groups:  speaker, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)    2.574      1.389   1.853

Question

What’s the problem?

The LME is estimating just 1 variance parameter for all of the speakers, thus assuming all speakers are drawn from 1 distribution, but there are actually two distributions of speakers here!

There is a fairly straightforwardw way to address this violation of the LME assumption, and that’s to include the relevant grouping predictor in the model!

fake_lme_2 <- lmer(outcome ~ group + (1|speaker), 
                   data = fake_speakers)
summary(fake_lme_2)
Linear mixed model fit by REML ['lmerMod']
Formula: outcome ~ group + (1 | speaker)
   Data: fake_speakers

REML criterion at convergence: 865.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.79273 -0.69862  0.00801  0.59735  2.86336 

Random effects:
 Groups   Name        Variance Std.Dev.
 speaker  (Intercept) 1.5118   1.2295  
 Residual             0.9735   0.9867  
Number of obs: 300, groups:  speaker, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  -0.3626     0.7144  -0.508
groupy        5.8725     1.0104   5.812

Correlation of Fixed Effects:
       (Intr)
groupy -0.707

Now that we have the relevant between-speaker predictor included in the model, notice how much the variance of the random-intercepts has decreased.

This brings up an important point - If there is a crucial between-speakers, between-words, between-subject, between-items, between-whatever predictor, include it in the model.

Returning to the real data example of ayT, there is one one major interspeaker variable which was not included in the model – date of birth. Plotting F1 against both decade and duration, there seem to be reliable effects. But because date of birth is a between-speakers predictor, when it wasn’t included in the model, this between speaker variation was accounted for by by-speaker random intercepts.

ayT_lme_multi <- lmer(F1_n ~ decade * dur_c + (1 + dur_c | idstring),
                      data = ayT)
summary(ayT_lme_multi)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ decade * dur_c + (1 + dur_c | idstring)
   Data: ayT

REML criterion at convergence: 26966.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.6483 -0.6007 -0.0478  0.5410 10.2455 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 idstring (Intercept) 0.02879  0.1697       
          dur_c       0.01646  0.1283   0.19
 Residual             0.22576  0.4751       
Number of obs: 19342, groups:  idstring, 326

Fixed effects:
              Estimate Std. Error t value
(Intercept)   1.099488   0.023338   47.11
decade       -0.116161   0.004603  -25.23
dur_c         0.267381   0.023780   11.24
decade:dur_c  0.015591   0.004517    3.45

Correlation of Fixed Effects:
            (Intr) decade dur_c 
decade      -0.892              
dur_c        0.032 -0.037       
decade:dr_c -0.040  0.073 -0.898

Violations of assumptions that can’t be accomodated

There are a few circumstances that might (definitely) arise in your actual data that violate the assumptions of mixed effects models that we can’t do anything about within the framework.

Different group variances

First, let’s revisit why adding the between-speaker grouping predictor was effective in satisfying the normality assumption.

fake_lme_2
Linear mixed model fit by REML ['lmerMod']
Formula: outcome ~ group + (1 | speaker)
   Data: fake_speakers
REML criterion at convergence: 865.1792
Random effects:
 Groups   Name        Std.Dev.
 speaker  (Intercept) 1.2295  
 Residual             0.9867  
Number of obs: 300, groups:  speaker, 6
Fixed Effects:
(Intercept)       groupy  
    -0.3626       5.8725  

The way to think about the random intercepts is that after taking into account all of the fixed effects, they’re how much remains to be explained by the pooling variables. We can visualize how the random effects structure “sees” the data by simply subtracting out the fixed effect of group y from their outcome data.

fake_speakers$outcome2 <- fake_speakers$outcome
fake_speakers$outcome2[fake_speakers$group == "y"] <- fake_speakers$outcome[fake_speakers$group == "y"] - fixef(fake_lme_2)[2]
ggplot(fake_speakers, aes(outcome2))+
  stat_bin(aes(fill = group), position = "identity", alpha= 0.6)+
  scale_fill_colorblind()+
  theme_bw()

By subtracting out the effect of group y, we align the distribution of data from groups x and y to be on top of eachother, and then the random effects are estimated as if this were just one distribution.

But, two groups of speakers are able to differ in more than just their means! for example, it’s possible that these two groups of speakers differ not only in ther means, but also in their variances.

speakers <- letters[1:6]
speaker_means_2 <- c(rnorm(3, mean = 0, sd = 1),
                   rnorm(3, mean = 4, sd = 3))
# Here, speaker group 2 is sampled from a distribution with sd = 3
# instead of sd = 1, like it was in the first example.
speaker_samps_2 <- map(speaker_means_2, ~rnorm(50, mean = .x, sd = 1)) %>% simplify()

You can fit an LME to this data, but the fact that group y has a much larger variance than group x won’t be accounted for. The variance estimates for the random intercepts will wind up splitting the difference between the actual variances of groups x and y.

summary(lmer(outcome ~ group + (1|speaker), data = fake_speakers_2))
Linear mixed model fit by REML ['lmerMod']
Formula: outcome ~ group + (1 | speaker)
   Data: fake_speakers_2

REML criterion at convergence: 893.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.58673 -0.64736  0.00081  0.65741  2.90522 

Random effects:
 Groups   Name        Variance Std.Dev.
 speaker  (Intercept) 9.210    3.035   
 Residual             1.046    1.023   
Number of obs: 300, groups:  speaker, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)   -1.019      1.754  -0.581
groupy         6.668      2.481   2.688

Correlation of Fixed Effects:
       (Intr)
groupy -0.707

Different variances within grouping factors

Let’s fit a random intercepts only model to the ayT data.

summary(ayT_intercept_mod)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ (1 | idstring)
   Data: ayT

REML criterion at convergence: 30275.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.1576 -0.6098 -0.0522  0.5510  9.9403 

Random effects:
 Groups   Name        Variance Std.Dev.
 idstring (Intercept) 0.1235   0.3515  
 Residual             0.2662   0.5160  
Number of obs: 19342, groups:  idstring, 326

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.58630    0.02016   29.08

Let’s think about how this model is modelling the data. First, it’s assuming that the speaker random intercepts are being drawn from a normal distribution. The mean of that distribution is the fixed effects intercept estimate, 0.59. The standard deviation of this distribution is estimated to be 0.35. So, we could treat individual speakers’ means as being drawn from a normal distribution with \(\mu = 0.59\) and \(\sigma=0.35\).

But, every speaker contributed multiple data points to the entire data set. In addition to a \(\mu\), every speaker must also have a \(\sigma\). Does the LME estimate a \(\sigma\) for each speaker?

In fact, the standard deviation of the residuals could be thought of as the \(\sigma\) for each speaker. So, no, the LME does not estimate a \(\sigma\) for every speaker. Rather, it estimates 1 \(\sigma\) for all speakers.

Now, strictly speaking, this assumption that every speaker will have the same \(\sigma\) is almost certainly not going to be true! Here’s what it looks like if we estimate the standard deviation for every speaker with a reasonable amount of data (to make sure we’re getting fairly reliable estimates), compared to the single standard deviation of the residuals from the LME.

ayT %>%
  group_by(idstring) %>%
  filter(n() > 40) %>%
  summarise(mean_F1 = mean(F1_n),
            sd_F1 = sd(F1_n),
            n = n()) %>%
  ggplot(aes(mean_F1, sd_F1))+
    geom_hline(yintercept=0.5160, 
               color = "red",
               size = 3)+
    geom_point(aes(size = n),
               alpha = 0.6)+
    scale_size_area()+
    theme_bw()

It could be of some interest to capture the fact that not all speakers have the same \(\sigma\), and it would definitely improve the fit of the model, but this just isn’t something that can be capture with LMEs.

Non-normal data

Just like with ordinary linear regression, if your outcome variable is non-normal, it’s technically going to violate the assumptons of normality. For example, if instead of fitting a model where we tried to estimate how the pronunciation of ayT changed, we tried to fit a model to see whether the duration of /ayT/ has changed over time?

summary(lmer(dur_c ~ decade + (1|idstring), data = ayT))
Linear mixed model fit by REML ['lmerMod']
Formula: dur_c ~ decade + (1 | idstring)
   Data: ayT

REML criterion at convergence: 32006.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.6716 -0.7235 -0.0428  0.6669  5.3633 

Random effects:
 Groups   Name        Variance Std.Dev.
 idstring (Intercept) 0.03914  0.1978  
 Residual             0.29649  0.5445  
Number of obs: 19342, groups:  idstring, 326

Fixed effects:
             Estimate Std. Error t value
(Intercept)  0.222092   0.026797   8.288
decade      -0.044471   0.005299  -8.393

Correlation of Fixed Effects:
       (Intr)
decade -0.891

Here, we get a fairly reliable effect (looks like /ayT/ has gotten shorter over time), but recall that the distribution of duration looks like this!

It’s not likely that either the by-speaker random intercepts nor the residuals will have a normal distribution. This is actually one scenario you can get around because it’s principled enough to log-scale the duration outcome, but there won’t always be a principled transform of skewed data like this.

Fixed Effect or Random Effect?

Before even getting to how to specify a random effects structure, people often also wonder which predictors should be included as a fixed effect, and which should be included as a random effect. For example, when looking at the relationship between duration and F1 for /ayT/, let’s suppose we had the following suspicion:

I think that different speakers might have different slopes for the relationship between duration and F1.

There are two different ways we could approach this

  1. Fit a linear model with an interaction between duration and speaker.
  2. Fit a linear mixed effects model with a random slope of duration by speaker.

Let’s just look at the first option briefly:

ayT_lm_interaction <- lm(F1_n ~ dur_c * idstring, data = ayT)
summary(ayT_lm_interaction)

I’ve actually left the summary of this model out of the notes, because it’s too long to be of any use. What we wind up with is an intercept and slope estimate for some reference speaker, and then the difference of everyother speakers’ intercepts and slopes from that reference speaker. The weirdness of choosing a reference level speaker can be ameliorted by using sum-contrasts, but that doesn’t address a more serious issue: This approach will wind up giving us estimates identical to the no-pooling method.

Each speaker’s coefficient and slope interaction is estimated without respect to the estimates of any other speaker. And that comes with all of the same shortcomings we dealt with in the no-pooling model.

On the other hand, let’s return to the broader ay data set. Recall we’ve been looking at /ay/ in strictly pre-voiceless context. But in the pre-voiced context, the relationship between F1 and the speakers’ date of birth looks alot different.

Looking at the graph, we might say:

I suspect that the relationship between F1 and decade will differ between pre-voiced and pre-voiceless contexts.

The way to model this would be to either:

  1. Fit a model with fixed effects interaction betweend decade and voicing.
  2. Fit a model with random slope of decade by voicing.

Just for illustration, let’s look at a model of this second approach:

summary(ay_lmer_slope)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ decade + (1 | idstring) + (1 + decade | plt_vclass)
   Data: ay

REML criterion at convergence: 169638.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.5863 -0.6114 -0.0119  0.5965  9.9651 

Random effects:
 Groups     Name        Variance Std.Dev. Corr
 idstring   (Intercept) 0.01120  0.10585      
 plt_vclass (Intercept) 0.01164  0.10788      
            decade      0.00608  0.07797  1.00
 Residual               0.43052  0.65614      
Number of obs: 84713, groups:  idstring, 326; plt_vclass, 2

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.25160    0.07760   16.13
decade      -0.07838    0.05521   -1.42

Correlation of Fixed Effects:
       (Intr)
decade 0.973 

We get out one fixed effects intercept and slope, and an estimate of the variance of the random slopes by voicing context. We can also look at the random effects predictions:

ranef(ay_lmer_slope)$plt_vclass

One thing that isn’t so great about this result is that shrinkage will ahve applied to these random slopes! That is, the predicted slopes for pre-voiced and pre-voiceless /ay/ are going to be less different here than they would be as fixed effects.


We can sum up the motivation for including a factor as a fixed or random effect based on whether or not we want shrinkage, or partial pooling, to apply between levels of the factor. Or, another way of putting it is whether or not we want to treat the parameter values for each level of the factor as being drawn from a population-level distribution. That is, the decision is always going to be guided by your conceptualization of the data, even though there are usually decisions that can be made by convention. Almost always, the following can be treated as random effects

  • individuals (speakers, subjects, texts, authors, etc.)
  • items (words, specific stimuli, etc.)

And most things that you want to look at for a generalizable effect can be fixed effects.

“Nuisance variables”

Another way to think about random effects is as “nuisance variables”. When modelling something as complex as why pronunciations change, there are always going to be factors missing from our models for a variety of reasons.

  • there are things we know matter, but failed to measure them for some reason.
  • there are things we know matter, but we don’t know how to measure them.
  • there are things that may not be measureable.
  • there are things we don’t realize matter.

If we don’t do our best to account for these “unmodellables”, they could very well mess with our fixed effects estimates. While they are not a perfect solution, random effects can go some way towards soaking up some of these unmodelable variables.

On the flip side, if you have something you know has an effect, include it in the model. For example, we know that gender has a big effect on many changes in progress. You should, therefore, include it in the model instead of hoping that the random effects will account for it. That’s because if there is a big between-subject variable that affects the outcome variable, the assumption that speaker random intercepts & slopes are drawn from a single normal distribution is violated.

Pitfalls in mixed effects modelling

The biggest issue people face when fitting a mixed effects model is that it may “fail to converge”. This is because these models aren’t ‘solved’ in the same way as the regular linear regression is. Because of their more complicated nature, these models are usually etimated via a process of iterative optimization.

It is very important to only base your inference on converged models. There are a handful of tweaks that might help in your models converging.

  • center and scale all continuous predictors.
  • if possible, choose a different factor level that has more data as your reference level.
  • remove the correlation component from your model

The imporance of centering and scaling your continuous predictors are very important here. Let’s look at a relatively simple model which is trying to predict F1 by date of birth, with a random intercept by speaker, and a random intercept and slope by word.

ay_fail <- lmer(F1_n ~ dob + (1 | idstring) + (1 + dob | word),
                data = ayT)
Model failed to converge with max|grad| = 54.1947 (tol = 0.002, component 1)Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
 - Rescale variables?
summary(ay_fail)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ dob + (1 | idstring) + (1 + dob | word)
   Data: ayT

REML criterion at convergence: 28433.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.8279 -0.5969 -0.0404  0.5465  9.6958 

Random effects:
 Groups   Name        Variance  Std.Dev.  Corr 
 idstring (Intercept) 5.638e-02 0.2374365      
 word     (Intercept) 2.931e-01 0.5413468      
          dob         1.044e-07 0.0003232 -0.90
 Residual             2.410e-01 0.4909313      
Number of obs: 19342, groups:  idstring, 326; word, 255

Fixed effects:
              Estimate Std. Error t value
(Intercept) 23.4652160  1.2169449   19.28
dob         -0.0116823  0.0006259  -18.66

Correlation of Fixed Effects:
    (Intr)
dob -1.000
convergence code: 0
Model failed to converge with max|grad| = 54.1947 (tol = 0.002, component 1)
Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?
Model is nearly unidentifiable: large eigenvalue ratio
 - Rescale variables?

This is bad. But remember, the dob variable is looking at the data like this:

This can pose a challenge for trying to estimate random intercepts and slopes! If we refit the model, this time with the centered and scaled decade predictor, it’ll converge.

ay_win <- lmer(F1_n ~ decade + (1 | idstring) + (1 + decade | word),
                data = ayT)
summary(ay_win)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ decade + (1 | idstring) + (1 + decade | word)
   Data: ayT

REML criterion at convergence: 28302.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.9964 -0.5958 -0.0401  0.5501  9.5362 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 idstring (Intercept) 0.031817 0.17837       
 word     (Intercept) 0.064517 0.25400       
          decade      0.000555 0.02356  -0.07
 Residual             0.240948 0.49086       
Number of obs: 19342, groups:  idstring, 326; word, 255

Fixed effects:
             Estimate Std. Error t value
(Intercept)  1.239348   0.036844   33.64
decade      -0.110736   0.006342  -17.46

Correlation of Fixed Effects:
       (Intr)
decade -0.714

But sometimes, no matter how you scale or center your predictors, the random effects structure might be too complex given the data. I tried fitting a model like this with the /ayT/ data, but it actually converged:

ay_big <- lmer(F1_n ~ decade * dur_c + 
                  (1 + dur_c | idstring) + 
                  (1 + decade * dur_c | word),
               data = ayT)
summary(ay_big)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ decade * dur_c + (1 + dur_c | idstring) + (1 + decade *      dur_c | word)
   Data: ayT

REML criterion at convergence: 26105.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.4098 -0.5848 -0.0401  0.5324 10.4415 

Random effects:
 Groups   Name         Variance  Std.Dev. Corr             
 idstring (Intercept)  0.0279960 0.16732                   
          dur_c        0.0159195 0.12617  0.21             
 word     (Intercept)  0.0920593 0.30341                   
          decade       0.0006019 0.02453  -0.56            
          dur_c        0.0415790 0.20391  -0.78  0.33      
          decade:dur_c 0.0017033 0.04127   0.81 -0.38 -0.97
 Residual              0.2119319 0.46036                   
Number of obs: 19342, groups:  idstring, 326; word, 255

Fixed effects:
              Estimate Std. Error t value
(Intercept)   1.115754   0.040275  27.703
decade       -0.102282   0.006440 -15.883
dur_c         0.307936   0.042315   7.277
decade:dur_c  0.001809   0.008129   0.223

Correlation of Fixed Effects:
            (Intr) decade dur_c 
decade      -0.781              
dur_c       -0.406  0.241       
decade:dr_c  0.398 -0.245 -0.914

But often, a model like this will fail to converge, and it’s easy enough to understand why. For many levels of word, there is only 1 token, but the model is trying to model a random intercept, slope for decade and duration, their interaction, and their covariances, for every level of word. One thing that could make a big model like this tractible is to eliminate the covariance structure using the (parameter || group) syntax

ay_bigish <- lmer(F1_n ~ decade * dur_c + 
                  (1 + dur_c | idstring) + 
                  (1 + decade * dur_c || word),
               data = ayT)
summary(ay_bigish)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ decade * dur_c + (1 + dur_c | idstring) + ((1 | word) +  
    (0 + decade | word) + (0 + dur_c | word) + (0 + decade:dur_c |      word))
   Data: ayT

REML criterion at convergence: 26118.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.4181 -0.5847 -0.0400  0.5323 10.4281 

Random effects:
 Groups   Name         Variance  Std.Dev. Corr
 idstring (Intercept)  2.805e-02 0.167467     
          dur_c        1.612e-02 0.126984 0.20
 word     (Intercept)  5.037e-02 0.224429     
 word.1   decade       4.893e-04 0.022120     
 word.2   dur_c        4.365e-03 0.066070     
 word.3   decade:dur_c 7.132e-05 0.008445     
 Residual              2.122e-01 0.460683     
Number of obs: 19342, groups:  idstring, 326; word, 255

Fixed effects:
              Estimate Std. Error t value
(Intercept)   1.117453   0.035017   31.91
decade       -0.102398   0.006079  -16.85
dur_c         0.288273   0.030780    9.37
decade:dur_c  0.004922   0.005585    0.88

Correlation of Fixed Effects:
            (Intr) decade dur_c 
decade      -0.718              
dur_c       -0.133  0.088       
decade:dr_c  0.068 -0.071 -0.779

We’ll talk more about issues like this more next week when we talk about builing up models.

Type I and Type II Error

These are poorly named, and I think few people remember which is which correctly the first time. These are errors related to the use of p-values. First it’s important to define what the Null Hypothesis is.

  • The Null Hypothesis is the Dull Hypothesis: The Null Hypothesis is that there is no relationship between the outcome and predictors. It is not what you think the most likely relationship is.

With that in mind, we can define Type I and Type II errors.

  • Type I - You reject the null hypothesis (because p < 0.05), but in reality there isn’t a relationship. That is, you say there is a reliable effect when there isn’t one.
  • Type II - You fail to reject the null hypotehsis when there is an real effect.

Or to put it another way, when you see a statistical result that is significant (p < 0.05), but you don’t think it could possibly be real, you think a Type I error has occurred.

And when you get p > 0.05, but you really think there’s something there so you put “trending towards significance” in your paper, you think a Type II error has occurred.

The best way to ameliorate Type I and Type II errors is to either collect more data, or to move to Bayesian Statistics, which we won’t be able to cover in this course, but I recommend this book.

Type M and Type S errors

These are error types discussed by Gelman and colleagues (and are better named).

  • Type M Errors are gross over-estimation of the magnitude of an effect size.
  • Type S Errors get the sign wrong on an estimate. That is, they predict an effect in the opposite direction of what the true effect is.

Type M might not sound like a very big problem, and Type S might sound a bit outlandish but these can both happen, especially when we are exploring for small effects and filtering results by their p-values. Gelman has proposed that Type M errors are a likely culprit in the so-called “decline effect”, which is a phenomenon some have found that large and significant effects gradually get smaller after their original reports.

Here are some ways to guard youself against Type M and Type S errors.

  • Try to reason how big the effect is likely to be.
  • Try to collect of how other factors effect the outcome you’re investigating, and how big those effects are.
  • Commit to the direction of the effect.

Type M and Type S errors are only going to grow in severity as the size of data sets grow.

Collinearity

“Collinearity” is when two potential predictors are highly correlated with each other. When two correlated predictors are included in the same regression, the results can be wonky at best. Gorman (2010) talks about this at length with respect to socioeconomic measures in the Language Change and Variation study. For example, speakers’ Occupation was highly correlated with the value of their Residence. You shouldn’t include these both, untreated, into your models.

There are at least three different strategies for dealing with this issue

  1. If you must model these factors independently, residualize (Gorman, 2010)
  2. Or, just pick the one that is theoretically motivated (Fruehwald, 2016).
  3. Or give up on treating your measures as having independent reality, realize they are reflections of more abstract dimensions, and try to figure out what those are (i.e. factor analysis, Walker (2016)).

Residualization

Like the name suggests, this involves the residuals from a linear model. If you really want to fit a model like this:

negative_concord ~ occupation + residence

Then what you need to do first is fit this model.

residence ~ occupation

And replace the residence term in the first model with the residuals of the second model. Now, it does matter which order you residualize in. How do you choose? Again, there’s no hard and fast answer, but please don’t try it every possible way and then just choose the model that looks best!

Factor Analysis

Another approach would be to abandon trying to look at all of these predictors independently and to treat them all as rough measures correlated to a more abstract variable, using something like a factor analysis. This is what Walker (2016) did.

The best way to explain a factor analysis is to imagine surveying a bunch of people on the street and asking them the following questions:

  1. Did you have the heat on this morning?
  2. Are you wearing a wooly jumper?
  3. Are you wearing a scarf?
  4. Could you see your breath when you stepped outside?
  5. Did your glasses fog up when you walked inside?
  6. Did you make yourself a big breakfast?
  7. Was your stomach growling when you woke up?

This looks like you’ve asked people 7 different questions, but really, you’ve only effectively asked 2 questions:

  1. Was it cold?
  2. Were you hungry?

If you took the raw data from the 7 question survey and ran it through a factor analysis, it would (ideally) tell you that there are only 2 effective questions in it, based on how responses to questions 1 thru 5 are correlated with each other, and 6&7 are correlated with each other.


Collinearity often comes up as a pitfall of of estimating effects. But it’s also an issue when trying to understand our models. A common, annoying, and usually valid question is:

Couldn’t your effect of A actually be an effect of B?

The best way to deal with this kind of collinearity is to try to kill your effects. Does the same effect hold in the same direction for important subsets of the data? Can you subset the data according to B and fit separate models for A?


---
title: "More On Mixed Effects Models"
output:
  html_notebook:
    code_folding: none
    css: ../custom.css
    theme: flatly
    toc: yes
    toc_depth: 3
    toc_float: yes
---

# Agenda

- Recapping last week
- Discussing the assumptions of LMEs
- Should variable be fixed or random effects?
- Some pitfalls.


```{r}
library(tidyverse)
library(broom)
library(ggplot2)
library(lme4)
library(multivariate2017)
```

```{r}
ayT <- ay %>% 
          filter(plt_vclass == "ay0") %>%
          mutate(dob = year-age,
                 decade = (dob - 1900)/10,
                 log2dur = log2(dur),
                 dur_c = log2dur - median(log2dur)) 

```


# Recapping LME 

Let's recap what we learned about about mixed effect models last week. 
First, it is a reasonable thing to want to know that the relationship between duration and the F1 of /ay/ is:

```{r}
ayT %>%
  ggplot(aes(dur_c, F1_n))+
    geom_point()+
    stat_smooth(method = 'lm')+
    theme_bw()
```

But if we fit just a flat model, there are many different violations of statistical assumptions, and the model is overconfident.

```{r}
ayT_flat_model <- lm(F1_n ~ dur_c, data = ayT)
summary(ayT_flat_model)
```

There is a lot of *structure* to the data that is being overlooked in estimating the relationship between duration and F1. 
First of all, it doesn't look like every speaker has the same relationship.

```{r}
random_speakers <- sample(unique(ayT$idstring), 8)
ayT%>%
  filter(idstring %in% random_speakers)%>%
  ggplot(aes(dur_c, F1_n))+
    geom_point()+
    stat_smooth(method = lm)+
    facet_wrap(~idstring, ncol= 4)+
    theme_bw()
```

Second, the data is *far* from balanced across speakers.

```{r}
ayT %>%
  count(idstring)%>%
  ggplot(aes(n))+
    stat_bin()+
    scale_x_log10()+
    xlab("n (log scale)")+
    theme_bw()
```


Let's look again at our estimates from the flat model:

```{r}
summary(ayT_flat_model)$coef
```

Those estimates are going to largely be a characterization of the relationship between duration and F1 *for the speakers with the most data*.
One way, you could try to get around this problem would be to fit one linear model for every speaker.

```{r}
ayT%>%
  ggplot(aes(dur_c, F1_n))+
    stat_smooth(method = lm, 
                aes(group = idstring), 
                se = F,
                geom = "line",
                alpha = 0.4)+
    theme_bw()
```

But this isn't ideal for a few reasons. 

- We don't wind up with a precise estimate of the general effect of duration on F1.
- We don't wind up with an estimate of our uncertainty of the effect of duration on F1.
- None of the estimates of these models takes into account the estimates of any other model.

So, instead, we fit a mixed effects model with a random intercept and a random slope by speaker.

```{r}
ayT_lme <- lmer(F1_n ~ dur_c + (1 + dur_c | idstring), data = ayT)
summary(ayT_lme)
```

Here, we have the fixed effects estimates:

```{r}
summary(ayT_lme)$coef
```

This is the best *estimate across speakers* for the relationship between F1 and duration.
But we also get some valuable info about how variable *around* these estimates we can expect observed speakers to be!

```{r}
summary(ayT_lme)$varcor
```

The standard deviation of the random slopes here is fairly respectable.
Let's say there was a speaker who had *double* the slope of the group estimate.
How many standard deviations would they be from the group average?


```{r}
slope <- 0.3428392 
slope_sd <- 0.13301

((slope * 2) - slope)/slope_sd
```

Two and a half sds is fairly large, but not an extreme outlier.



# The assumptions of Mixed Effects Models

Mixed effects models relax some assumptions of fixed-effects models, namely that all observations are i.i.d. It does so by allowing for partial pooling of the data according to grouping factors.
However, mixed effects models are not *without* assumptions.

Specifically, the model assumes that *the random effects are drawn from a normal distribution*.
The clue that the model is making this assumption is in that it gives us estimates of standard deviation of the random effects, which is a property of normal disributuons.

This assumption has a good effect when it comes to predicting the slopes and intercepts for individual speakers.
Last week, we looked at the results of fitting one linear model for each speaker.
Here's the density plots of the distribution of all of the slopes and intercepts from all of those models.

```{r echo = F}
ayT %>% 
  group_by(idstring)%>%
  nest()%>%
  mutate(model = map(data, ~lm(F1_n ~ dur_c, data = .x)),
         params = map(model, tidy),
         n = map(data, nrow) %>% simplify())%>%
  unnest(params)%>%
  select(idstring, term, estimate, n)%>%
  spread(term, estimate) -> params

ayT_lme_fixef <- fixef(ayT_lme)

```

```{r echo = F}
params %>%
  gather(param, est, `(Intercept)`:dur_c)->param_long
param_long %>% 
  ggplot(aes(est))+
    stat_density(alpha = 0.8, color = "black")+
    facet_wrap(~param, scales = "free")+
    xlab("estimate")+
    theme_bw()
```



Now, let's compare these by-speaker intercepts and slopes (in grey) to the random effects from our linear mixed effects model (in yellow).
```{r echo = F}
library(ggthemes)
ranef(ayT_lme)$idstring %>%
  mutate(`(Intercept)` = `(Intercept)`+ ayT_lme_fixef[1],
          dur_c = dur_c + ayT_lme_fixef[2],
         estimator = "lme")%>%
  gather(param, est, `(Intercept)`:dur_c)%>%
  bind_rows(param_long %>% mutate(estimator = "lm"))%>%
  ggplot(aes(est))+
    geom_density(aes(fill = estimator), 
                 position = "identity",
                 alpha = 0.6)+
    facet_wrap(~param, scales="free")+
    scale_fill_colorblind()+
    theme_bw()
```


You can see that for both the intercept and the slope, but especially the slope, the range of interspeaker values has been sucked in.
This is also known as "shrinkage."
If we were interested in how different the effect of duration could be between speakers, this would suggest that simply fitting a model for every speaker would *overestimate* the variance!

Shrinkage is desireable for a few reasons.
The first is that it is often the case that extreme outlier subjects will be less extreme on a second testing (i.e. regression to the mean). 
So, it makes sense to moderate the estimates of outlier speakers.
The second is that we ought to have higher uncertertainty for many of the speakers who are estimated to have an extreme duration effect than for those speakers whose slopes are closer to the group average


To demonstrate this point, here are the six speakers who had the biggest difference between the *Maximum Likelihood* estimate of their slopes (that is, from fitting a model for each speaker) and their predicted slopes from the random effects.
The Maximum Likelihood estimate is the dashed black line, and the random effects predictions are in the red lines.


```{r echo = F}
ayT_lme_ranef <- ranef(ayT_lme)$idstring
ayT_lme_ranef$idstring <- rownames(ayT_lme_ranef)
ayT_lme_ranef <- ayT_lme_ranef %>% 
                    mutate(`(Intercept)` = `(Intercept)` + ayT_lme_fixef[1],
                           dur_c = dur_c + ayT_lme_fixef[2])
ayT_lme_ranef_long <- ayT_lme_ranef %>% 
                          gather(param, est, 1:2)%>%
                          mutate(model = "lme")
param_long %>% 
  mutate(model = "lm")%>%
  bind_rows(ayT_lme_ranef_long)%>%
  select(idstring, param, est, model)%>%
  spread(model, est)%>%
  filter(param == 'dur_c')%>%
  mutate(diff = abs(lm-lme))%>%
  arrange(-diff) %>%
  slice(1:6)->extrema_diff


param_long %>% 
  mutate(model = "lm")%>%
  bind_rows(ayT_lme_ranef_long)%>%
  select(idstring, param, est, model)%>%
  spread(model, est)%>%
  filter(param == 'dur_c')%>%
  mutate(diff = abs(lm-lme))%>%
  arrange(diff) %>%
  slice(1:6)->extrema_same

ayT %>%
  filter(idstring %in% extrema_diff$idstring)%>%
  mutate(idstring = factor(idstring, levels = extrema_diff$idstring))%>%
  ggplot(aes(dur_c, F1_n))+
    geom_point()+
    geom_abline(data = params %>% filter(idstring %in% extrema_diff$idstring),
                aes(slope = dur_c,
                    intercept = `(Intercept)`),
                linetype = 2)+
    geom_abline(data = ayT_lme_ranef %>% filter(idstring %in% extrema_diff$idstring),
                aes(slope = dur_c,
                    intercept = `(Intercept)`),
                color = "red")+  
    facet_wrap(~idstring)+
    theme_bw()

```

Compare this to the speakers whose ML estimates are the most similar to the random effects predictions.

```{r echo = F}
ayT %>%
  filter(idstring %in% extrema_same$idstring)%>%
  mutate(idstring = factor(idstring, levels = extrema_same$idstring))%>%
  ggplot(aes(dur_c, F1_n))+
    geom_point()+
    geom_abline(data = params %>% filter(idstring %in% extrema_same$idstring),
                aes(slope = dur_c,
                    intercept = `(Intercept)`),
                linetype = 2)+
    geom_abline(data = ayT_lme_ranef %>% filter(idstring %in% extrema_same$idstring),
                aes(slope = dur_c,
                    intercept = `(Intercept)`),
                color = "red")+  
    facet_wrap(~idstring)+
    theme_bw()
```


The assumption that the random effects are drawn from a normal distribution are great because 

1. Shrinkage is good.
2. It makes their estimation tractible on most laptops!

But as with all statistical *assumptions*, some caution should be taken where they may be violated.
The most major, and likely most common, violation of this normality assumption is if all of the levels of your grouping factor *aren't* drawn from a single distribution!
For example, here is some simulated data from 6 speakers. Speakers `a`, `b` and `c` come from a group disribution with mean 0, and speakers `d`, `e` and `f` come from a group distribution with mean 4.


```{r tidy = FALSE}
speakers <- letters[1:6]
speaker_means <- c(rnorm(3, mean = 0, sd = 1),
                   rnorm(3, mean = 4, sd = 1))
```
```{r}
speaker_samps <- map(speaker_means, ~rnorm(50, mean = .x, sd = 1)) %>% simplify()
```
```{r}
fake_speakers <- data.frame(speaker = rep(speakers, each = 50),
                            group = rep(c("x", "y"), each = (50 * 3)),
                            outcome = speaker_samps)
```
```{r}
ggplot(fake_speakers, aes(outcome, fill = group))+
  stat_bin(position = "identity", alpha = 0.8)+
  scale_fill_colorblind()+
  theme_bw()
```

We *could* fit an mixed effects model with just a random intercept by speaker.

```{r}
fake_lme_1 <- lmer(outcome ~ (1|speaker), data = fake_speakers)
summary(fake_lme_1)
```

<div class = "box break">
<span class = "big-label">Question</span>

What's the problem?
</div>

The LME is estimating just 1 variance parameter for all of the speakers, thus assuming all speakers are drawn from 1 distribution, but there are actually *two* distributions of speakers here!

There is a fairly straightforwardw way to address this violation of the LME assumption, and that's to include the relevant grouping predictor in the model!

```{r}
fake_lme_2 <- lmer(outcome ~ group + (1|speaker), 
                   data = fake_speakers)
```
```{r}
summary(fake_lme_2)
```

Now that we have the relevant between-speaker predictor included in the model, notice how much the variance of the random-intercepts has decreased.

This brings up an important point - If there is a crucial between-speakers, between-words, between-subject, between-items, between-whatever predictor, *include it in the model*.

Returning to the real data example of `ayT`, there is one one major interspeaker variable which was not included in the model -- date of birth.
Plotting F1 against both decade and duration, there seem to be reliable effects.
But because date of birth is a between-speakers predictor, when it wasn't included in the model, this between speaker variation was accounted for by by-speaker random intercepts. 

<div style = "width:100%;float:left;">
<div style = "width:45%;float:left;margin-left:auto;margin-right:auto;">
```{r echo = F}
ggplot(ayT, aes(decade, F1_n))+
    geom_point()+
    theme_bw()+
    ggtitle("decade")
```
</div>
<div style = "width:45%; float:left;margin-left:auto;margin-right:auto;">
```{r echo = F}
ggplot(ayT, aes(dur_c, F1_n))+
    geom_point()+
    theme_bw()+
    ggtitle("duration")
```
</div>
</div>

```{r}
ayT_lme_multi <- lmer(F1_n ~ decade * dur_c + (1 + dur_c | idstring),
                      data = ayT)
summary(ayT_lme_multi)
```



## Violations of assumptions that *can't* be accomodated

There are a few circumstances that might (definitely) arise in your actual data that violate the assumptions of mixed effects models that we can't do anything about within the framework.

### Different group variances

First, let's revisit *why* adding the between-speaker grouping predictor was effective in satisfying the normality assumption.

```{r}
fake_lme_2
```

The way to think about the random intercepts is that after taking into account all of the fixed effects, they're how much remains to be explained by the pooling variables.
We can visualize how the random effects structure "sees" the data by simply subtracting out the fixed effect of group `y` from their outcome data.

```{r}
fake_speakers$outcome2 <- fake_speakers$outcome
fake_speakers$outcome2[fake_speakers$group == "y"] <- fake_speakers$outcome[fake_speakers$group == "y"] - fixef(fake_lme_2)[2]
```

```{r}
ggplot(fake_speakers, aes(outcome2))+
  stat_bin(aes(fill = group), position = "identity", alpha= 0.6)+
  scale_fill_colorblind()+
  theme_bw()
```

By subtracting out the effect of group `y`, we align the distribution of data from groups `x` and `y` to be on top of eachother, and then the random effects are estimated as if this were just one distribution.

**But**, two groups of speakers are able to differ in more than just their means!
for example, it's possible that these two groups of speakers differ not only in ther means, but also in their variances. 


```{r}
speakers <- letters[1:6]
speaker_means_2 <- c(rnorm(3, mean = 0, sd = 1),
                   rnorm(3, mean = 4, sd = 3))
# Here, speaker group 2 is sampled from a distribution with sd = 3
# instead of sd = 1, like it was in the first example.
```
```{r}
speaker_samps_2 <- map(speaker_means_2, ~rnorm(50, mean = .x, sd = 1)) %>% simplify()
```
```{r echo  = F}
fake_speakers_2 <- data.frame(speaker = rep(speakers, each = 50),
                            group = rep(c("x", "y"), each = (50 * 3)),
                            outcome = speaker_samps_2)
```
```{r echo  = F}
ggplot(fake_speakers_2, aes(outcome, fill = group))+
  stat_bin(position = "identity", alpha = 0.8)+
  scale_fill_colorblind()+
  theme_bw()
```

You can fit an LME to this data, but the fact that group `y` has a much larger variance than group `x` won't be accounted for. 
The variance estimates for the random intercepts will wind up splitting the difference between the actual variances of groups `x` and `y`.

```{r}
summary(lmer(outcome ~ group + (1|speaker), data = fake_speakers_2))
```

### Different variances within grouping factors

Let's fit a random intercepts only model to the `ayT` data.

```{r}
ayT_intercept_mod <- lmer(F1_n ~ (1|idstring), data = ayT)
summary(ayT_intercept_mod)
```

Let's think about how this model is modelling the data.
First, it's assuming that the speaker random intercepts are being drawn from a normal distribution.
The mean of that distribution is the fixed effects intercept estimate, `r round(fixef(ayT_intercept_mod)[1], digits = 2)`. 
The standard deviation of this distribution is estimated to be 0.35.
So, we could treat individual speakers' means as being drawn from a normal distribution with $\mu = 0.59$ and $\sigma=0.35$.

![](figures/speaker_variances.svg)

But, every speaker contributed multiple data points to the entire data set.
In addition to a $\mu$, every speaker must also have a $\sigma$. 
Does the LME estimate a $\sigma$ for each speaker?

In fact, the standard deviation of the *residuals* could be thought of as the $\sigma$ for each speaker. 
So, no, the LME does not estimate a $\sigma$ for every speaker.
Rather, it estimates *1* $\sigma$ for all speakers. 

Now, strictly speaking, this assumption that every speaker will have the same $\sigma$ is almost certainly not going to be true! 
Here's what it looks like if we estimate the standard deviation for every speaker with a reasonable amount of data (to make sure we're getting fairly reliable estimates), compared to the single standard deviation of the residuals from the LME.



```{r}
ayT %>%
  group_by(idstring) %>%
  filter(n() > 40) %>%
  summarise(mean_F1 = mean(F1_n),
            sd_F1 = sd(F1_n),
            n = n()) %>%
  ggplot(aes(mean_F1, sd_F1))+
    geom_hline(yintercept=0.5160, 
               color = "red",
               size = 3)+
    geom_point(aes(size = n),
               alpha = 0.6)+
    scale_size_area()+
    theme_bw()
```

It could be of some interest to capture the fact that not all speakers have the same $\sigma$, and it would definitely improve the fit of the model, but this just isn't something that can be capture with LMEs.

### Non-normal data

Just like with ordinary linear regression, if your outcome variable is non-normal, it's technically going to violate the assumptons of normality.
For example, if instead of fitting a model where we tried to estimate how the pronunciation of `ayT` changed, we tried to fit a model to see whether the *duration* of /ayT/ has changed over time?

```{r}
summary(lmer(dur ~ decade + (1|idstring), data = ayT))
```

Here, we get a fairly reliable effect (looks like /ayT/ has gotten shorter over time), but recall that the distribution of duration looks like this!

<div class = "half-img">
```{r echo = F, fig.width=2, fig.height = 2}
ggplot(ayT, aes(dur))+
  stat_density()+
  theme_bw()
```
</div>


It's not likely that either the by-speaker random intercepts nor the residuals will have a normal distribution. 
This is actually one scenario you can get around because it's principled enough to log-scale the duration outcome, but there won't always be a principled transform of skewed data like this.





# Fixed Effect or Random Effect?

Before even getting to how to specify a random effects structure, people often also wonder which predictors should be included as a fixed effect, and which should be included as a random effect. 
For example, when looking at the relationship between duration and F1 for /ayT/, let's suppose we had the following suspicion:

> I think that different speakers might have different slopes for the relationship between duration and F1.

There are two different ways we could approach this

1. Fit a linear model with an interaction between duration and speaker.
2. Fit a linear mixed effects model with a random slope of duration by speaker.

Let's just look at the first option briefly:

```{r}
ayT_lm_interaction <- lm(F1_n ~ dur_c * idstring, data = ayT)
```

```{r}
summary(ayT_lm_interaction)
```

I've actually left the summary of this model out of the notes, because it's too long to be of any use. 
What we wind up with is an intercept and slope estimate for some reference speaker, and then the difference of everyother speakers' intercepts and slopes from that reference speaker.
The weirdness of choosing a reference level speaker can be ameliorted by using sum-contrasts, but that doesn't address a more serious issue: This approach will wind up giving us estimates *identical* to the no-pooling method.

Each speaker's coefficient and slope interaction is estimated without respect to the estimates of any other speaker. 
And that comes with all of the same shortcomings we dealt with in the no-pooling model.


On the other hand, let's return to the broader `ay` data set.
Recall we've been looking at /ay/ in strictly pre-voiceless context. 
But in the pre-voiced context, the relationship between F1 and the speakers' date of birth looks alot different.

```{r}
ay <-  ay %>% mutate(dob = year-age, 
                    decade = (dob - 1900)/10)
ay %>%
  group_by(idstring, dob, plt_vclass)%>%
  summarise(F1_n = mean(F1_n)) %>%
  ggplot(aes(dob, F1_n, color = plt_vclass))+
    geom_point()+
    scale_color_colorblind()+
    theme_bw()
```

Looking at the graph, we might say:

> I suspect that the relationship between F1 and decade will differ between pre-voiced and pre-voiceless contexts.

The way to model this would be to either:

1. Fit a model with fixed effects interaction betweend decade and voicing.
2. Fit a model with random slope of decade by voicing.

Just for illustration, let's look at a model of this second approach:

```{r}
ay_lmer_slope <- lmer(F1_n ~ decade + (1|idstring) + (1 + decade | plt_vclass),
                      data = ay)

summary(ay_lmer_slope)
```

We get out one fixed effects intercept and  slope, and an estimate of the variance of the random slopes by voicing context.
We can also look at the random effects predictions:

```{r}
ranef(ay_lmer_slope)$plt_vclass
```

One thing that isn't so great about this result is that shrinkage will ahve applied to these random slopes!
That is, the predicted slopes for pre-voiced and pre-voiceless /ay/ are going to be less different here than they would be as fixed effects.

-------

We can sum up the motivation for including a factor as a fixed or random effect based on whether or not we want shrinkage, or partial pooling, to apply between levels of the factor. 
Or, another way of putting it is whether or not we want to treat the parameter values for each level of the factor as being drawn from a population-level distribution. 
That is, the decision is *always* going to be guided by your conceptualization of the data, even though there are usually decisions that can be made by convention. Almost always, the following can be treated as random effects

- individuals (speakers, subjects, texts, authors, etc.)
- items (words, specific stimuli, etc.)

And most things that you want to look at for a generalizable effect can be fixed effects.

## "Nuisance variables"

Another way to think about random effects is as "nuisance variables".
When modelling something as complex as why pronunciations change, there are always going to be factors missing from our models for a variety of reasons.

- there are things we know matter, but failed to measure them for some reason.
- there are things we know matter, but we don't know *how* to measure them.
- there are things that may not be *measureable*.
- there are things we don't realize matter.

If we don't do our best to account for these "unmodellables", they could very well mess with our fixed effects estimates. 
While they are not a perfect solution, random effects can go some way towards soaking up some of these unmodelable variables.

On the flip side, if you have something you *know* has an effect, include it in the model. 
For example, we know that gender has a big effect on many changes in progress.
You should, therefore, include it in the model instead of hoping that the random effects will account for it.
That's because if there is a big between-subject variable that affects the outcome variable, the assumption that speaker random intercepts & slopes are drawn from a single normal distribution is violated.


# Pitfalls in mixed effects modelling

The biggest issue people face when fitting a mixed effects model is that it may "fail to converge". This is because these models aren't 'solved' in the same way as the regular linear regression is.
Because of their more complicated nature, these models are usually etimated via a process of iterative optimization.

It is very important to only base your inference on converged models. There are a handful  of tweaks that might help in your models converging.

- center and scale all continuous predictors.
- if possible, choose a different factor level that has more data as your reference level.
- remove the correlation component from your model


The imporance of centering and scaling your continuous predictors are very important here. 
Let's look at a relatively simple model which is trying to predict F1 by date of birth, with a random intercept by speaker, and a random intercept and slope by word.

```{r}
ay_fail <- lmer(F1_n ~ dob + (1 | idstring) + (1 + dob | word),
                data = ayT)
```
```{r}
summary(ay_fail)
```

This is *bad*.
But remember, the `dob` variable is looking at the data like this:

<div class = "half-img">
```{r, echo = F}
ggplot(ayT, aes(dob, F1_n))+
    geom_point()+
    geom_abline(aes(intercept = 23.4652160,
                    slope = -0.0116823),
                color = "red")+
    xlim(0, 1998)+
    ylim(-2, 23)+
    theme_bw()
```
</div>

This can pose a challenge for trying to estimate random intercepts and slopes! If we refit the model, this time with the centered and scaled `decade` predictor, it'll converge.

```{r}
ay_win <- lmer(F1_n ~ decade + (1 | idstring) + (1 + decade | word),
                data = ayT)
```
```{r}
summary(ay_win)
```

But sometimes, no matter how you scale or center your predictors, the random effects structure might be too complex given the data.
I tried fitting a model like this with the /ayT/ data, but it actually converged:


```{r}
ay_big <- lmer(F1_n ~ decade * dur_c + 
                  (1 + dur_c | idstring) + 
                  (1 + decade * dur_c | word),
               data = ayT)
```

```{r}
summary(ay_big)
```

But often, a model like this will fail to converge, and it's easy enough to understand why.
For many levels of `word`, there is only 1 token, but the model is trying to model a random intercept, slope for decade and duration, their interaction, and their covariances, for *every* level of `word`.
One thing that could make a big model like this tractible is to eliminate the covariance structure using the `(parameter || group)` syntax

```{r}
ay_bigish <- lmer(F1_n ~ decade * dur_c + 
                  (1 + dur_c | idstring) + 
                  (1 + decade * dur_c || word),
               data = ayT)
```
```{r}
summary(ay_bigish)
```


We'll talk more about issues like this more next week when we talk about builing up models.


## Type I and Type II Error

These are poorly named, and I think few people remember which is which correctly the first time. These are errors related to the use of p-values. First it's important to define what the **Null Hypothesis** is.

- **The Null Hypothesis is the Dull Hypothesis:** The Null Hypothesis is that there is no relationship between the outcome and predictors. It is *not* what you think the most likely relationship is.

With that in mind, we can define Type I and Type II errors.

- **Type I** - You reject the null hypothesis (because p < 0.05), but in reality there *isn't* a relationship. That is, you say there is a reliable effect when there isn't one.
- **Type II** - You fail to reject the null hypotehsis when there *is* an real effect. 

Or to put it another way, when you see a statistical result that is significant (p < 0.05), but you don't think it could possibly be real, you think a Type I error has occurred.

And when you get p > 0.05, but you really think there's something there so you put "trending towards significance" in your paper, you think a Type II error has occurred.

The best way to ameliorate Type I and Type II errors is to either collect more data, or to move to Bayesian Statistics, which we won't be able to cover in this course, [but I recommend this book](https://sites.google.com/site/doingbayesiandataanalysis/).



 
 
## Type M and Type S errors

These are error types [discussed by Gelman and colleagues](http://www.stat.columbia.edu/~gelman/research/published/PPS551642_REV2.pdf) (and are better named).

 - **Type M Errors** are gross over-estimation of the *magnitude* of an effect size.
 - **Type S Errors** get the *sign* wrong on an estimate. That is, they predict an effect in the opposite direction of what the true effect is.
 
Type M might not sound like a very big problem, and Type S might sound a bit outlandish but these can both happen, especially when we are [exploring for small effects and filtering results by their p-values](http://rpubs.com/JoFrhwld/291734). Gelman has proposed that [Type M errors are a likely culprit in the so-called "decline effect"](http://andrewgelman.com/2010/12/13/the_truth_wears/), which is a phenomenon some have found that large and significant effects gradually get smaller after their original reports.

Here are some ways to guard youself against Type M and Type S errors.

- Try to reason how big the effect is *likely* to be.
- Try to collect of how other factors effect the outcome you're investigating, and how big those effects are.
- Commit to the *direction* of the effect.

Type M and Type S errors are only going to grow in severity as [the size of data sets grow](http://jofrhwld.github.io/papers/plc39_2015/#/).



## Collinearity

"Collinearity" is when two potential predictors are highly correlated with each other. When two correlated predictors are included in the same regression, the results can be *wonky* at best. [Gorman (2010)](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1146&context=pwpl) talks about this at length with respect to socioeconomic measures in the Language Change and Variation study. For example, speakers' Occupation was highly correlated with the value of their Residence. You *shouldn't* include these both, untreated, into your models. 


There are at least three different strategies for dealing with this issue

1. If you *must* model these factors independently, residualize (Gorman, 2010)
2. *Or*, just pick the one that is theoretically motivated (Fruehwald, 2016).
3. *Or* give up on treating your measures as having independent reality, realize they are reflections of more abstract dimensions, and try to figure out what those are (i.e. factor analysis, Walker (2016)).


### Residualization

Like the name suggests, this involves the residuals from a linear model. If you really want to fit a model like this:

```
negative_concord ~ occupation + residence
```

Then what you need to do first is fit this model.

```
residence ~ occupation
```

And replace the `residence` term in the first model with the *residuals* of the second model. Now, it does matter which order you residualize in. How do you choose? Again, there's no hard and fast answer, but please don't try it every possible way and then just choose the model that looks best!

### Factor Analysis

Another approach would be to abandon trying to look at all of these predictors independently and to treat them all as rough measures correlated to a more abstract variable, using something like a factor analysis. This is what [Walker (2016)](https://repository.upenn.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1940&context=pwpl) did.

The best way to explain a factor analysis is to imagine surveying a bunch of people on the street and asking them the following questions:

1. Did you have the heat on this morning?
2. Are you wearing a wooly jumper?
3. Are you wearing a scarf?
4. Could you see your breath when you stepped outside?
5. Did your glasses fog up when you walked inside?
6. Did you make yourself a big breakfast?
7. Was your stomach growling when you woke up?


This looks like you've asked people 7 different questions, but really, you've only effectively asked 2 questions: 

1. Was it cold?
2. Were you hungry?

If you took the raw data from the 7 question survey and ran it through a factor analysis, it would (ideally) tell you that there are only 2 effective questions in it, based on how responses to questions 1 thru 5 are correlated with each other, and 6&7 are correlated with each other.


---


Collinearity often comes up as a pitfall of of estimating effects. But it's also an issue when trying to understand our models. A common, annoying, and usually valid question is:

> Couldn't your effect of A *actually* be an effect of B?

The best way to deal with this kind of collinearity is to try to kill your effects. Does the same effect hold in the same direction for important subsets of the data? Can you subset the data according to B and fit separate models for A?



------



